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Abstract. Ubiquitous computing environments are characterized by an 
unbounded amount of noise and crosstalk. In these environments, tradi- 
tional methods of sound capture are insufficient, and array microphones 
are needed in order to obtain a clean recording of desired speech. In this 
work, we have designed, implemented, and tested LOUD, a novel 1020- 
node microphone array utilizing the Raw tile parallel processor architec- 
ture [1] for computation. To the best of our knowledge, this is currently 
the largest microphone array in the world. We have explored the uses 
of the array within ubiquitous computing scenarios by implementing an 
acoustic beamforming algorithm for sound source amplification in a noisy 
environment, and have obtained preliminary results demonstrating the 
efficacy of the array. From one to 1020 microphones, we have shown a 
13.7dB increase in peak SNR on a representative utterance, an 87.2% 
drop in word error rate with interferer present, and an 89.6% drop in 
WER without an interferer. 



1 Introduction 

The interaction between humans and computers has been a central focus of ubiq- 
uitous computing research in recent times. In particular, communication through 
speech has been extensively explored as a method for making human-computer 
interaction more natural. However, computer recognition of human speech per- 
forms when a recording can be made without the presence of much ambient noise 
or crosstalk. Seeking to create a natural setting, ubiquitous computing environ- 
ments tend to fall in this category of situations where natural-interaction speech 
recognition is a challenging problem. When significant levels of noise are present, 
or several humans are talking at the same time, recognition becomes difficult, 
if not impossible, in the absence of an appropriate technology to separate the 
desired speech from the undesired speech and ambient noise. As part of MIT's 
Project Oxygen [2], we have created a modular large microphone array, called 
the Large acOUstic Data, or LOUD Array. 

Recently, arrays of microphones have been increasingly explored as an aid for 
untcthcred acoustic source selection and amplification. When sound is recorded 
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in a noisy environment through a single microphone, proximity of the microphone 
to the speaker's mouth is essential for audio of sufficient quality for speech recog- 
nition or transmission in a tele-conference. In many situations, this proximity 
can not be readily achieved - for instance when the recording is taking place in a 
conference room environment marred with crosstalk, a machine room with noisy 
fans, or a large auditorium where an audience member is asking a question. In 
these situations, a single microphone at a fixed location in the room cannot sepa- 
rate the voice of one speaker from another, or the voice of the speaker from noise. 
In contrast, arrays of microphones have a spatial extent that can be exploited 
along with the propagation qualities of a sound wave to detect, separate, and am- 
plify speech sources in a noisy environment. Microphone arrays can be "steered" 
in software toward a desired sound source, filtering out undesired sources. When 
an appropriate level of computational power is available, microphone arrays can 
also "track" a desired source around a space as the source moves [3] . 

The LOUD modular microphone array currently consists of 1020 micro- 
phones. Our reasons for building such a large array are twofold. First, the 
performance of a microphone array improves linearly as the size of the array 
grows. This is well established in the theoretical literature on microphone arrays 
(e.g., [4, 5]), and our experimental results in Section 5 confirm this in practice. 
To date, the largest microphone array known to us [6] had been a 512-elemcnt 
array, and work in microphone arrays of this size has been extremely limited. 
Second, ubiquitous computing applications often involve a large number of si- 
multaneous feeds of streaming data (e.g., video, audio, haptics, etc). The I/O 
bandwidth and computational power necessary to process these streaming data 
has pushed the limits of traditional computer architectures and I/O schemes. 
To this end, our lab has been designing a scalable parallel processing architec- 
ture called Raw [1], specifically designed to handle large volumes of streaming 
data, such as that created by a microphone array. Raw belongs to a new class of 
microprocessors called tiled architectures, which are designed to be effective for 
both general purpose and embedded computation. Our microphone array, which 
generates nearly 50 MB of data ever second, is an appropriate application to 
test the limitations of this architecture. In fact, an acoustic beamforming algo- 
rithm for 1020 microphones is present in the suite of tasks used to evaluate the 
performance of the Raw processor [7] . 

In this paper we first present our architecture for a novel modular 1020-nodc 
microphone array and beamformer utilizing a general-purpose tile processor ar- 
chitecture for computation. The modularity of the array represents the first 
major contribution of this work. Second, we present the results for a prelim- 
inary round of experiments giving speech recognition accuracy rates for data 
collected with the array in a noisy environment, both in and out of the pres- 
ence of an interferer. We show an improvement in speech recognition accuracy 
from 3.0% with one microphone to 87.6% with the full 1020-microphone array 
(87.2% drop in word error rate) when an interferer is present, and from 9.6% 
with one microphone to 90.6% with the full array with no interferer (89.6% drop 
in WER). The SNR improves by 13.7dB from one to 1020 microphones for a 
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representative utterance. As the other main contribution of this work, we show 
a steady improvemnt in recognition performance and SNR as the size of the ar- 
ray is increased to 1020 microphones, thus clearly demonstrating the benefit of 
the use of large arrays to record speech in noisy environments. A video demon- 
strating the improvement in sound quality when using the array is available at 
http://cag.csail.mit.edu/mic-array/videos/. Finally, we outline our use of 
the Raw general-purpose tile parallel processor architecture, which, in contrast 
with previously-used DSP chips, allows programmers to write code in conven- 
tional programming languages. Raw is currently able to run an acoustic beam- 
forming algorithm in real-time on all the 1020 simultaneous data streams from 
the array. 

We begin the paper with an overview of related work in Section 2. We then 
outline the details of our microphone array hardware and processing software 
implementation in Section 3. Section 4 presents the setup and methods used in 
our experiments to evaluate the array. In Section 5, we present the results of our 
experiments, and in Section 6, we discuss the results and relate our findings to 
past work. In Section 7, we outline our plans for future work, and conclude the 
paper. 



2 Related Work 

Sensor arrays have been extensively explored in the past half-century, initially as 
a tool for radar-based tracking of objects [8] , and then for a number of other ap- 
plications including radio astronomy [9], sonar systems [10], and seismology [11]. 
Over the past two decades, arrays of microphones (i.e., acoustic sensors in air) 
have been increasingly used for sound source separation and amplification, and 
since the late 1980s have been explored as a tool for capturing audio in difficult 
acoustic environments [12, 13]. 

Microphone arrays have quickly become popular as an aide for speech recog- 
nition, and several recent projects are exploring the use of microphone arrays 
for this purpose [14-16]. A number of projects report significant improvements 
in recognition performance when using a microphone array when compared to 
a single omnidirectional microphone. For instance, [14] reports a near three-fold 
decrease in recognition error rates using a circular array of eight microphones 
in a conference room, and [16] reports similar gains with an eight-microphone 
linear array. However, all of these used substantially smaller arrays than the one 
presented in this paper. 

In the recent past, microphone arrays have seen increased exposure in ubiq- 
uitous and multimodal computing applications. For instance, [17] used a two- 
microphone array as part of a multi-modal person tracking system on a mobile 
robot, and [18] used two microphones on a speaker's tic to detect whether speech 
was coming from the speaker or another person in the environment. [19] used a 
32-microphone array in conjunction with a computer vision-based person track- 
ing system to selectively amplify speakers in the presence of noise and interfering 
speakers. 
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The literature [4, 5] states that the performance of microphone arrays will 
theoretically scale with an even much larger number of microphones. However, 
most microphone arrays used in research and industry today have a small number 
of microphones (i.e., less than 20). There are, however, a small number of larger 
arrays in existence. [19-21] present intermediate-sized arrays of 32, 64, and 64 
microphones, respectively. [22] shows a 400- microphone square array two meters 
on a side, but no publications seem to be available about the project. Finally, the 
Huge Microphone Array [6], an array of 512 microphones, has to our knowledge 
been the largest microphone array to date. The researchers of this project have 
designed custom hardware for both sound capture and processing (using DSP 
chips). However, the publications stemming from this work all appear to use a 
16-microphone subset of the large array known as the Brown Megamike. For 
instance, [23] showed improved recognition performance for this smaller array 
size. Based on investigation of past work, it appears that there is little or no 
published work on speech recognition experiments using microphone arrays of 
the scale of the array presented in this work. 

3 Implementation 

This section outlines our implementation of the 1020- node microphone array 
and beamformer. We first outline the hardware and firmware design of the array 
components and the connections to the Raw tile processor. We then present the 
array geometry that we have used and our reasons for choosing this geometry. 
Finally, we describe the software algorithms used to process the data recorded 
by the array, and our mapping of these algorithms onto Raw. 

3.1 Hardware 

Our microphone array feeds data into the Raw microprocessor [1], which is a 
parallel tile architecture currently being researched in our lab. The design of the 
Raw 16-tile processor has taken place since 1997, and our lab received the first 
prototype chips in early 2002. Raw is a parallel machine specifically designed 
for applications requiring real-time processing of large amounts of streaming 
data. By exposing the details of the interconnection networks on the chip to the 
software, Raw allows for highly efficient systolic communication on the chip, and 
thus exposes a potential for a great deal of parallel real-time computation. Raw 
provides two static networks and two dynamic networks; the static networks 
being more efficient for systolic real-time computation. The static network is 
controlled by an entirely independent switch processor on each tile, and the 
routing code that runs on the switch is exposed to the software. In this work, 
we utilize the static networks as outlined in Section 3.3. 

The 1020-node microphone array (Figure 3) consists of 510 two-microphone 
printed circuit boards (PCBs), pictured in Figure 1 (the microphones used were 
Panasonic WM-54BT Elcctret Condenser microphones) We have opted to cre- 
ate small microphone modules to ensure LEGO-like modularity in the design 
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of our array. Each PCB contains two microphones, one stereo A-to-D converter 
(Cirrus Logic CS53L32A), and a small CPLD (Xilinx Coolrunner XCR3032XL). 
The A-to-D converter samples at 16 KHz, generating 24-bit serial data for each 
microphone. Our decision to place two microphones on one PCB was mainly due 
to the fact that the A-to-D converter is able to accommodate two channels of 
audio. The two-microphone boards are connected in chains of 16 boards (32 mi- 
crophones), and each chain plugs into a connector board. The data are streamed 
through the chain and into the connector board using time-division multiplexing. 
Each connector board takes eight chains, and four connector boards are used to 
accommodate 1024 microphones in total. 




Fig. 1. A two-microphone board from the LOUD array. 



The four connector boards are connected to an expansion connector on the 
Raw parallel processor motherboard via a high-bandwidth micro-coax cable. The 
cable can accommodate up to five connector boards, so 256 additional micro- 
phones could be added with the current configuration. A large FPGA on the Raw 
motherboard (Xilinx 3000E) converts the serial data from the array into packets 
of parallel words. The packets are streamed into Raw on one of its sixteen I/O 
ports. Currently two of the Raw I/O ports are linked to physical connectors on 
the Raw motherboard, meaning a total of 2560 microphones could be accommo- 
dated with the current motherboard; but if more connectors were added or if 
other Raw boards were used, we could theoretically support an arbitrarily large 
number of microphones. 

Figure 2 illustrates the hardware design of the array, the interconnections 
between the array and Raw, and the associated bandwidths at each link. Each of 
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the 32-microphone chains produce 12.3 Mbits/sec. Each connector board receives 
eight microphone chains, or 98.3 Mbits/sec. There are four connector boards, 
meaning the total bandwidth into Raw is 393 Mbits/sec, or 49.1 MBytes/sec. 



1024 microphones 




Fig. 2. A schematic diagram of the LOUD microphone array hardware design, the con- 
nections to the Raw tile processor, and the badndwidths required at each connection. 



3.2 Geometry 

Many array geometries have been suggested in past work, from linear to rect- 
angular to circular; and, similarly, many microphone spacing schemes have been 
suggested, from uniform to logarithmic. While many geometrical configurations 
of the array are possible and potentially desirable, our initial 1020-microphonc 
geometry (pictured in Figure 3) is a rectangular array 60 microphones wide 
by 17 microphones high. This allows us to steer the amplification beam verti- 
cally as well as horizontally. Since microphone arrays sample the signal in both 
space and time, spatial sampling [4] as well as temporal sampling can affect 
the resulting waveform. Arrays spatially sample at the intra-microphone spacing 
wavelength, and any source signal component with a wavelength shorter than 
twice the spacing will be aliased (per the Nyquist criterion). For this work, we 
have chosen to use uniform spacing at 3 cm (meaning spatially sampling the 
waveform at „ "^ =11,400 Hz, meaning frequencies above 5,700 Hz will be 
aliased). This decision was due both to practicality reasons, as well as prelimi- 
nary experiments with various spacings. The 3 cm spacing is maintained in both 
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the vertical and horizontal directions, by deliberate placement of microphone 
boards horizontally on an aluminum plate, and by using spacers of appropriate 
lengths to stack boards vertically. 




Fig. 3. A picture of the LOUD 1020-node microphone array 



3.3 Software 



In order to utilize the microphone array to selectively amplify sound coming 
from a particular source or sources, we have used a beamforming algorithm [4, 
24] on our tile parallel processor. Beamforming algorithms use the properties of 
sound propagation through space for sound source separation. Currently, we are 
using a delay-and-sum beamforming algorithm [24] , which is the simplest way of 
computing the beam. Delay-and-sum beamforming uses the fact that the delay 
for the sound wave to propagate from one microphone in the array to the next 
can be empirically measured or calculated from the array geometry. This delay 
is different for each direction of sound propagation, i.e., from the sound source 
position. By delaying the signal from each microphone by an amount of time 
corresponding to the direction of propagation and then summing the delayed 
signals, we selectively amplify sound coming from a particular direction. Sub- 
sample precision delays are handled by interpolation between the two adjacent 
integral sample values. Delay-and-sum beamforming assumes that the position 
of the desired source relative to the array is known. The problem of accurately 
localizing a source is crucial, but rather separate from the problem of amplifying 
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sound coming from a particular direction. For the work presented in this pa- 
per, we assume that the position of the speaker is known in advance; however, 
Section 7 outlines our plans to pursue work in source localization. 

The delay-and-sum beamforming algorithm runs on the Raw microprocessor. 
The audio collected at each microphone in the array is streamed into one Raw 
static network input port. The data are streamed from tile to tile on the static 
network, with each tile's switch processor directing a portion of the data into the 
local processor. The processor then stores the data into memory, and retrieves 
the appropriately-delayed sample for each of its microphones. The running sum 
for the beamforming computation is passed along from tile to tile, also on the 
static network. The mapping of computation onto Raw tiles is illustrated in 
Figure 4. Twelve tiles are used for computation, one for delaying the output 
for debugging, one for bandpass-filtering the output, and one for formatting 
it properly for the output D-to-A converter (used for monitoring). One tile is 
unused due to I/O port placement constraints. 




Fig. 4. Beamforming algorithm mapping to Raw tiles. 



In order to enable this implementation, we have created a flexible framework 
for mapping computation to appropriate tiles, including scripts to automatically 
generate static network code for the Raw switch processor. We have optimized 
the code to run the beamforming algorithm for one fixed source position from 
all 1020 microphones in real time on one Raw chip (16 tiles). Due to current 
firmware constraints, we are able to run Raw at 150 MHz; however, the Raw 
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chip can support clock rates upward of 400 MHz, thus in the future we could 
easily accommodate more microphones, or more computation per microphone. 

4 Evaluation 

We have conducted preliminary experiments with the LOUD array. The exper- 
iments involved recording a person speaking in a room where several sources 
of noise were present. The room is a very noisy hardware laboratory. The main 
noise sources are several tens of cooling fans for computers and custom hardware, 
and a loud air conditioner. 

The subject reads a series of digit strings, and the speech is recorded with var- 
ious subsets of the LOUD microphone array and a high-quality noise-canceling 
close talking microphone (Scnnhciser HMD-410) (i.e., "clean" speech). In some 
of the trials, another person serves as the "interferer," reading a text passage 
(the "Rainbow Passage" often used in speech recognition experiments) at the 
same time as the main speaker is speaking digit strings. The interferer scenario 
models a situation where several people in a room are talking, but we are inter- 
ested in recording the voice of only one person, such as in a conference or in a 
surveillance situation. 

As mentioned in Section 3.3, we have assumed a fixed position for our sub- 
ject. The amount of time required for the sound to travel from the position 
to each microphone in the array was determined in advance with the following 
procedure. A broadband "chirp" (frequency sweep) was played through a small 
loudspeaker located at the "focal point" - the desired point for amplification. A 
reference recording was obtained with a single microphone at the speaker. The 
data captured by each microphone was also captured and stored on disk. The 
data were then upsampled by a factor of 50 to obtain sub-sample precision in 
the calculation. A cross-correlation function (basically a dot-product at every 
possible time offset) was then calculated between the reference recording and 
the signal from each microphone. The time shift for which the function was at 
a maximum was taken as the propagation delay for that microphone. Anecdo- 
tally, this method was shown to be more accurate than calculating microphone 
delays from array and point geometry; probably due to some slop in microphone 
positions on the boards. 

The layout of the experimental setup is as follows. The array is positioned on 
a counter top 145cm from the ground. The "focal point" of the array is located 
in line with the left edge of the array, or 88.5 cm to the left of the center of the 
array, 137 cm in front of the array, and 25 cm above the bottom row of the array. 
The interferer stands at a mirror image point in line with the right edge of the 
array, or 88.5 cm to the left of center, and 137 cm in front. 

The data from the microphone array are streamed into the Raw micropro- 
cessor as described in Section 3.1, and are stored in the 2GB of off chip DRAM 
currently available to the processor. Once the audio streams are stored in mem- 
ory, the processor performs a delay-and-sum bcamforming (see Section 3.3) run 
for 23 different microphone sizes/configurations, ranging from one microphone 
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to all 1020 microphones. This process simulates simultaneously recording the 
same audio stream with arrays of varying sizes, allowing us to compare the ef- 
fect of the number of microphones on array performance. The microphones are 
first taken from the bottom row in powers of two (1, 2, 4, 8, 16, 32), and then 
row by row all the way to the top of the array (each row adds 60 microphones - 
60, 120, 180, ..., 1020). 

The output of the bcamforming algorithm for each virtual array configura- 
tion is processed with a band-pass filter set to pass through frequencies between 
300Hz and 3,500Hz, in order to eliminate low-frequency noise and high-frequency 
content that would be aliased when downsampling to 8KHz (as needed for recog- 
nition). Finally the output waveforms are streamed from the chip to a desktop 
host machine via a static network port. The resulting waveforms are then writ- 
ten to disk on the the host machine. Due to current host interface constraints, 
it was not practical to record the output of all 1020 microphones; however, this 
capability will soon be in place, allowing us to record a corpus of large- array 
recordings (see Section 7). 

In order to obtain an initial metric of array performance, we measured the 
signal-to-noise ratio (SNR) of the beamformer output. The SNR for an utterance 
can be defined in many ways. When a precise recording of the original source 
audio is available, the SNR can be calculated as the variance between the source 
signal and the noisy signal. When such a recording is not available, or can not 
be reliably made, the "peak" SNR can be approximated by taking the maximum 
signal power over a time window during the speech segments of the waveform 
as signal, and the signal power over a non-speech segment as the noise. While a 
reference recording was available to us from the close-talking microphone, it was 
still somewhat noisy; thus we chose to use the second method. However, since 
the method required hand-segmenting the bcamformed audio for the 23 different 
microphone configurations (to find speech and non-speech components) , we only 
obtained SNR figures for only one representative recording. 

The evaluation portion of our experiment consisted of running the output of 
the beamforming algorithm through the MIT SUMMIT recognizer [25], which is 
a feature-based finite state transducer speech recognizer created at the Spoken 
Language Systems group in our laboratory. The recognizer was trained on a com- 
bination of clean and noisy speech from the Aurora digits corpus [26] . The Aurora 
corpus is based on over 4,000 samples of humans reading digit strings recorded 
with a close-talking microphone from [27]. The clean data are augmented by 
synthetically adding noise from typical environments (e.g., train, babble, car, 
and exhibition hall) at various SNRs to the utterances to simulate noisy data. 
The simulated noisy speech in combination with the clean speech constitutes 
over 28,000 utterances, which are all used to train the SUMMIT recognizer. We 
note that due to the channel differences between the close-talking microphone 
used to record the Aurora data and the LOUD microphone array, it is unrealis- 
tic to expect the array test data to match recognition rates given in the Aurora 
literature. 
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For this initial round of experiments, we recorded 150 utterances from two 
male speakers with an interfcrcr, and 110 utterances from the same speakers 
without interferers. The data for the close-talking microphone were collected 
as 80 utterances with an interferer at the same time as the array experiments, 
in order to provide a baseline for the speech recognition experiments. In the 
interfcrcr trials, the person not serving as the subject served as the interferer. 
Certainly, much more extensive testing is necessary in order to evaluate the 
microphone array in sufficient detail; this work is ongoing (see section 7). 

The amplification pattern of a microphone array bcamformer consists of a 
main amplification lobe and a number of smaller side lobes. The width of the 
main lobe quantifies the precision of the beam. A metric known as half-power 
beam width (HPBW) [4] measures the distance from the focus of the beam at 
which the amplification strength of the beam drops off by a factor of two (3dB). 
To measure HPBW, we played a sine wave from a one-inch diameter portable 
speaker while moving it around in space. At the same time we continually ran 
the power calculation code on Raw in conjunction with the bcamformer. In this 
fashion, we were able to obtain some preliminary measurements of the HPBW 
of our array. 



5 Results 

Figure 5 gives approximate peak SNRs in dB for a representative utterance, dis- 
playing the trend of improvement as the number of microphones is increased. The 
close-talking microphone, with an SNR level of 35.0dB, serves as the baseline. 
The SNR improves from 17.2dB with one microphone to 30.9dB with all 1020 
microphones. This 13.7dB improvement corresponds to a 4.6-fold improvement 
in the ratio of signal energy to noise energy. 

Speech recognition quality is typically evaluated based on the word error 
rate (WER) meaning the percentage of words that were recognized incorrectly. 
In this work, we mostly give results in terms of the accuracy rates, which is simply 
calculated as (100% — WER). Figure 6 and Table 1 give the accuracy rates for 
the experimental data that we have collected, for all the array sizes ranging 
from one microphone to all 1020. Our baseline accuracy is for a close-talking 
microphone, at 98.8%. This is consistent with results from the Aurora corpus 
when tested on clean speech [26] , meaning that our speech recognizer performs on 
a level consistent with other state-of-the-art recognizers. For one microphone, the 
accuracy is below 10% both with and without an interfcrcr, meaning acceptable 
recognition is impossible to achieve. The accuracy rises above 50% around 60 
microphones (one full row of the array) - a reasonable recognition hypothesis 
can sometimes be made at this level. All 1020 microphones yield 87.6% accuracy 
(87.2% drop in word error rate) in the presence of an interferer and 90.6% (89.6% 
drop in WER) without an interfcrcr. 

Our measurements of the HPBW of the array show that when listening to a 
1 KHz source, the energy level drops off by half when the source moves 5 inches 
horizontally, or 10 inches vertically from the point the algorithm is amplifying. 
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Fig. 5. Peak SNRs for one representative recording from the microphone array. 



This result is consistent with the 60x17 array geometry because the array has 
a greater spatial extent horizontally than vertically. However the amplification 
pattern is different for each frequency of sound, and speech is a broadband signal. 
Thus it is difficult to quantify exactly how well the beamformer would be able 
to separate two speakers at arbitrary positions in the space. 

6 Discussion 



The results in Figures 5 and 6 clearly demonstrate there is indeed a benefit to 
having arrays of this large size. In this work, we focused on the design of the sys- 
tem, and did not implemented sophisticated beamforming algorithms or other 
signal processing software components. However, even with the simplest beam- 
former possible, we were able to obtain increasing SNRs and gains in recognition 
accuracy all the way to 1020 microphones. This is perhaps the most significant 
result. 

The most drastic jump in the recognition accuracy curve is seen when the 
number of microphones jumps from 32 to 60, most likely because this completes 
the full line of the array (60 microphones), making the beam width almost twice 
as narrow as with 32 microphones. After this point, adding more microphones 
does not make the array wider, just taller. We note that the accuracy even with 
1020 microphones (87.6% and 90.6%) is clearly significantly short of the 98.8% 
baseline from the close-talking microphone; and this is consistent with the SNRs 
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Number of Microphones 



Fig. 6. Experimental results from the LOUD microphone array. Results for data 
recorded with the array both in the presence of an interference, and when the in- 
terferer is absent, are given. The baseline level of 98.8% accuracy is given when using 
a high-quality close-talking microphone. 



noted in the recordings. However, with more complicated signal processing and 
beamforming algorithms and a better match between the recognizer training and 
test conditions (see Section 4) , we are confident that the recognition accuracy of 
audio recorded with the array can approach that of a close-talking microphone. 
One can imagine that a projection based on Figures 5 and 6 will eventually allow 
array performance to reach close-talking microphone levels. 

Comparison with past work is difficult for several reasons. One reason is dif- 
ferences in experimental conditions. Our data were collected in a very noisy envi- 
ronment; likely noisier than most of the currently-published results. For instance, 
[14] cites an accuracy rate of 42.4% with a single omnidirectional microphone 
and one interferer, compared to our 3.0%; [23] is at 58%; and [16] is at 33%. 
While SNR is one intuitive way of comparing noise levels, it is actually difficult 
to compare based on SNR, since the various methods for determining SNR can 
produce very different results. Some of the previous work uses a form of average 
SNR over a hand-segmented waveform; however in our experiments, due to time 
constraints, it was impossible to perform such an analysis. Another alternative is 
to use a speech recognizer to segment the waveform and then calculate average 
SNRs; however this approach is often inaccurate because recognizer segmenta- 
tion will be poor at low SNRs. In addition, there is much opinion that SNRs 
may not be a good measure of speech quality at all [28, 23]. 
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Number of Microphones 


Peak Signal-to-Noise Ratio (dB) 


Accuracy Rate 






Interferer Present 


No Interferer 


1 


17.2 


3.0% 


7.1% 


2 


16.7 


3.2% 


6.0% 


4 


17.9 


3.6% 


6.7% 


8 


18.5 


5.7% 


11.2% 


16 


20.7 


9.5% 


19.6% 


32 


21.0 


14.7% 


30.7% 


60 


22.1 


54.0% 


70.0% 


120 


23.4 


61.8% 


74.6% 


180 


24.4 


65.5% 


81.1% 


240 


25.3 


69.6% 


83.8% 


300 


26.6 


72.4% 


86.3% 


360 


26.1 


75.8% 


86.8% 


420 


27.2 


77.1% 


87.5% 


480 


27.6 


78.6% 


87.7% 


540 


28.2 


80.8% 


88.3% 


600 


28.6 


82.0% 


88.9% 


660 


29.6 


83.2% 


89.6% 


720 


29.6 


83.9% 


89.5% 


780 


30.0 


84.7% 


89.8% 


840 


30.2 


85.1% 


90.3% 


900 


30.4 


86.0% 


90.8% 


960 


30.5 


87.0% 


90.8% 


1020 


30.9 


87.6% 


92.0% 


Close-talking baseline 


35.0 


98.8% 



Table 1. Recognition accuracy rates and signal-to- noise ratios with various test con- 
ditions 



Further making it difficult to compare this work with past results is the 
fact that most currently published experimental results of speech recognition 
experiments with microphone arrays, use a much smaller arrays than this work. 
For instance, in his 1996 PhD thesis, Sullivan [16] writes, 

... the [recognition] improvement appears to level off at about 8 sensors. 
This suggests that there may be other factors involved in word errors that 
the type of processing provided by microphone arrays cannot correct. In 
our experiments it is fair to say that the amount of extra computation 
and hardware needed to process additional microphones beyond 4-8 mi- 
crophones appears to exceed the benefits that one obtains from including 
them. 

The work reports that recognition accuracy levels off at around 60% after 
eight microphones. In our work, we were able to achieve recognition levels well 
in excess of this; perhaps the discrepancy can be attributed to difference in 
speech recognition technology (the work reported by Sullivan was published eight 
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years ago). Indeed, in a later work, Adcock [23] was able to achieve recognition 
accuracy rates above 80% for noisy speech recorded with a 16- microphone array, 
but starting at 58% for one microphone, as compared to 33% in Sullivan. It is 
unknown whether the accuracy rates in Adcock would have continued improving 
with larger array sizes. Our work starts at recognition rates below 10% for one 
microphone, and accuracy only exceeds the 50% mark until after 60 microphones 
are used, clearly demonstrating a difference in noise conditions. 

In general, it is probably fair to note that popular opinion in this direction 
among the signal processing community has made it seem that researching large 
arrays was impractical and unbcncficial. We submit that by demonstrating a 
consistent improvement with increasing array sizes, we have shown that, at least 
at the current time, this belief is not entirely accurate. While our accuracy rates 
even with a 1020-microphone array fall short of that of recordings made with 
a close-talking high-quality microphone, we believe that this merely serves to 
motivate future research in adaptive beamforming and other advanced sound 
source selection techniques (see Section 7). 

Regarding the hardware needed for the real-time processing of data from a 
large microphone array, we have also shown that using a parallel tile architec- 
ture is both sufficient and practical for arrays of this size. Past work in large 
arrays (e.g., [6]) has used special-purpose DSP chips. In contrast, with our use 
of the Raw microprocessor, we have followed the recent trend in the computer 
architecture field to leverage general-purpose tiled architectures to increase both 
computation density and ease of programming. 

7 Conclusion 

In this work we have presented LOUD, a 1020-node microphone array and beam- 
former for intelligent computing spaces. We have outlined our design of the array 
hardware and software architecture, and our utilization of the Raw tile parallel 
processor for computation. In addition, we have presented an analysis of exper- 
imental data collected with the array. The data were captured with the array, 
processed with a beamforming algorithm, and evaluated by means of running 
a speech recognizer on the beamformed data. We have presented experimental 
signal-to-noise ratios and recognition accuracy scores for 23 different array sizes, 
ranging from one to 1020 microphones, showing a steady improvement all the 
way to 1020 microphones. We believe that with these results we have made the 
case that large microphone arrays deserve a thorough investigation both by the 
ubiquitous computing and signal processing communities. 

We are pursuing several directions for future work in this project. First, we 
are continuing work to evaluate the microphone array's usefulness in ubicomp 
scenarios. The experiments presented in this work are only a first step towards 
a full evaluation of the system. In order to prove the applicability of our micro- 
phone array in ubiquitous computing, we plan to conduct user studies in which 
the subjective quality of the speech produced by the array can be measured on 
communication tasks such as tele-conferencing. 
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In addition, in order to evaluate the quantitative performance of the array 
further, more speech data will be collected from more speakers, in more points, 
in different noise environments, with different array configurations and spacings, 
etc. Our group is currently working on increasing the speed of data collection 
by improving the USB 2.0 interface between the host computer and the Raw 
motherboard. Improved speed here will allow us to record data from all 1020 
microphones, rather than that for only 23 beamformed channels. Once this work 
is completed, the data collection effort will be greatly simplified, since the re- 
quirement to run bcamforming on the data at collection time will be eliminated. 
The data will simply be compressed on the Raw chip and shipped off to stor- 
age. Once we are able to quickly record data from all 1020 microphones, we will 
be able to produce an array microphone recording corpus, which we will make 
available for public distribution on the web. Other array corpora have been made 
available in the past, but only with a smaller number of microphones (e.g. [29] 
has data from 37 and 23 microphones) . 

Improved host interface speeds will enable us to perform experiments on the 
data that currently run slower than real-time. For instance, we plan to experi- 
ment with "tracking" speakers as they move in the space around the array. This 
can be accomplished by performing subsequent rounds of perturbing the beam 
position, or delay values, and moving in the direction of maximum energy over a 
time window (basically, a gradient ascent approach) . This approach is a type of 
adaptive bcamforming, or bcamforming that adapts as aspects of the environ- 
ment change. Even with our current stationary experimental setup, it would be 
useful to have this algorithm in place: since it is impossible for the speaker to 
stand in precisely the same place from trial to trial, we would start the beam at 
the point corresponding to our initial delay measurements, and then search for 
the speaker in the surrounding space. The position with the maximum in energy 
likely corresponds to the position of the speaker's mouth. This approach can 
be used generally to search for speakers in a room, although there are several 
difficulties. For instance, if the speaker is silent over several energy windows, the 
beam could wander away and actually amplify a noise source. Or, if two speakers 
are present and both talking, it may be hard to maintain a consistent tracking of 
one of the speakers. In order to compensate for these difficulties, this method can 
be augmented with information from other modalities. For instance, a group in 
our lab [19] has used computer vision algorithms for person detection algorithms 
using stereo cameras. The vision-based person location was used as a first-order 
approximation for a 32-microphone array bcamformcr, attempting to recognize 
speech in the presence of noise and interferers. This multimodal sensor fusion 
fits in well with the goals of our ubiquitous computing research under Project 
Oxygen; and we plan to explore this approach with the LOUD array. 

We are pursuing several target applications for microphone arrays within 
Project Oxygen. We plan to integrate the microphone array into our ubicomp 
research spaces, such as our lab's kiosk platform for human-computer interaction 
experiments [30]. A microphone array will greatly increase the usability (and 
hopefully, the use) of our kiosks; the current method is to as the user to wear a 
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close-talking microphone headset, which is bulky and inconvenient for passers-by 
wishing to only have a short interaction with the kiosk. 

Another aspect that we plan to consider in the future is the effect that room 
acoustics and environmental conditions (other than noise) have on array per- 
formance. Array performance can be affected by reverberations and distortions 
due to the room in which the array is located. This effect is likely present in our 
work, since the array is currently located in a cluttered hardware lab with many 
hard surfaces. We plan to measure array performance in with different room con- 
figurations in order to understand the effect of these factors. In addition, when 
microphone delays are calculated from array geometry (as is usually the case 
in beamforming) , a particular value for the speed of sound must be assumed. 
However, the speed of sound varies with room temperature and humidity; for 
instance, a five-degree Celsius variation in temperature changes sound speed by 
more than 3 m/s. The accuracy of the calculations, and thus the performance 
of the array can vary as the temperature changes. 
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